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Abstract 

We empirically evaluate a stochastic annealing strategy for Bayesian posterior opti¬ 
mization with variational inference. Variational inference is a deterministic approach to 
approximate posterior inference in Bayesian models in which a typically non-convex 
objective function is locally optimized over the parameters of the approximating dis¬ 
tribution. We investigate an annealing method for optimizing this objective with 
the aim of finding a better local optimal solution and compare with deterministic 
annealing methods and no annealing. We show that stochastic annealing can provide 
clear improvement on the GMM and HMM, while performance on LDA tends to favor 
deterministic annealing methods. 


1 Introduction 


Machine learning has produced a wide variety of useful tools for addressing a number of 
practical problems, often for those which involve large-scale datasets. Indeed, a number of 
disciplines ranging from recommender systems to bioinformatics rely on machine intelligence 
to extract useful information from their datasets in an efficient manner. One of the core 
machine learning approaches to such tasks is to define a prior over a model on data and infer 


the model parameters through posterior inference (Blei, 2014). The gold-standard in this 


direction is Markov chain Monte Carlo (MCMC), which gives a means for collecting samples 


from this posterior distribution in an asymptotically correct way (Robert & Casella, 2004) 


A frequent criticism of MCMC is that it is not scalable to large data sets—though recent- 


work has begun to address this (e.g., Welling & Teh (2011); Maclaurin & Adams (2014)). 


Instead, variational methods (Wainwright & Jordan 2008) are proposed as an alternative for 


approximating the posterior distribution of a model more quickly by turning inference into 
an optimization problem over an objective function. Though the learned distribution is not 
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as technically correct as the empirical distribution constructed from MCMC samples, fewer 
iterations are required and ideas from stochastic optimization are immediately applicable for 
large-scale inference (Hoffman et al, 2013). 


However, a significant issue faced by variational inference methods is that the objective is 
usually highly non-convex, and so only locally optimal solutions of the posterior approximation 
can be found. One response to this problem is to simply rerun the optimization from various 
random starting points and select the best local optimal solution. This opens variational 
inference up to the same criticisms as MCMC, since the cumulative number of iterations 
performed by variational inference may be comparable to a single chain of MCMC. Therefore, 
the advantage of scalability with variational inference is significantly reduced. 


Since variational inference is an instance of non-convex optimization, trying to improve 
this local optimal problem with the existing annealing approaches is a promising direction. 


Deterministic annealing has been studied for variational inference both formally Katahira 


et al. (2008); Yoshida & West (2010); Abrol et al. (2014) and informally Beal (2003). These 


approaches perform a deterministic inflating of the variational entropy term, which shrinks 
with each iteration to allow for exploration of the variational objective in early iterations. 


Quantum annealing has been studied as well Sato et al. (2009). 


Another long-studied annealing approach involves stochastic processes and have been intro¬ 


duced and analyzed in the context of global minimization of a non-convex function (Benzi 


et al, 1982, Kirkpatrick et al, 1983, Cerny 1985 Genian & Hwang. 1986). Though the 


conditions for finding such a global optimum may be impractical, the resulting theoretical 
insights have suggested practical methods for finding better local optimal solutions than found 
by their non-annealed counterparts. Unlike deterministic annealing, stochastic annealing 
appears to have been overlooked for variational inference. With this motivation, the goal 
of this paper is to develop a stochastic annealing algorithm for variational inference and 
compare its performance with deterministic and non-annealed optimization. 


We demonstrate that, like deterministic annealing, improving the performance of variational 
inference without compromising its scalability is possible using stochastic annealing. Our 


approach is inspired by the method of simulated annealing (Kirkpatrick et al, 1983), which 
prevents the gradient steps from getting trapped in a bad local optimum early on. We show 
that this approach can improve the performance of variational inference on several models, 
often improving over deterministic annealing. 


The rest of the paper is organized as follows. In Section 2, we present an overview of annealing 
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for optimization and how it can be connected to variational inference. In Section 3 we present 
our method in the context of conjugate exponential family models. In Section 4 we validate 


our approach with three models: Latent Dirichief allocation (Blei et al, 2003), the hidden 
Markov model (Rabiner 1989) and the Gaussian mixture model (Bishop, 2006). 


2 Background 

2.1 Variational inference 


Given data X and a model with variables O = {#;}, the goal of posterior inference is to 
find p(0\X). This is intractable in most models and so approximate methods are used. 
Mean-field variational inference (Wainwright & Jordan 2008) performs this task by proposing 
a simpler factorized distribution q(6) = ?($*) t° approximate p(0\X) by minimizing their 

KL-divergence. This is equivalently done by maximizing the objective function 


£ = E q [lnp(X,0)] -Ejlng], 


( 1 ) 


Computing the objective only requires the joint likelihood p(X,0), which is known by 
definition and is a function of the parameters of each q(6i), A = {Aj}. 

The function £ can be optimized using gradient ascent on the parameters of q, which for 
step t can be written as 

At+i 4 — A* + 7*V £|\ t . (2) 

where is a step size. In practice this gradient is usually done for each A* separately holding 
the others fixed rather than for the entire set A. 


Solving the approximate inference problem with variational methods has proved useful in a 
number of applications, but one important shortcoming is that the updates in ([2]) are only 
guaranteed to converge to a local maximum for non-convex problems. Clearly it is important 
to come up with optimization procedures that find better local optima, or even the global 
optimum solution. The updates in ([2]) arise in a many problems where optimization by 
gradient methods is used; in these areas, research has been done toward finding better local 


optima that can be modified for application to variational inference as well (Benzi et al. 


1982; Kirkpatrick et al. 1983; Cerny, 1985 Geman & Hwang, 1986). 
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2.2 Simulated annealing 


Since (|2]) always steps in the direction of the gradient the result will get trapped in a local 
optima that is highly dependent on the initialization. One way to overcome this problem 


is the method of simulated annealing (Kirkpatrick et al, 1983). The basic idea here is to 
instead make the update 

\+i A* + 7tV£|A t + T t £t , (3) 

where e t is a random noise vector controlled by a “temperature” variable T t > 0 that converges 
to zero as t —» oo. The update is then accepted or rejected in a manner similar to Metropolis- 
Hasting MCMC. The idea is that, in the initial steps the value of t is large enough to prevent 
A t from getting trapped in a local maximum (i.e., X t is volatile enough to escape from the 
local maximum due to the high temperature T t ). As the temperature decreases the movement 
is more restricted to being “uphill” until the sequence eventually converges. 


Simulated annealing was first used for discrete variables (Kirkpatrick et al, 1983 Cerny 


1985 Genian & Geman, 1984). This was later extended to continuous random variables and 


analyzed in the context of continuous-time processes (Geman & Hwang, 1986), which results 
in the following Langevin-type Markov diffusion, 


dX(t) = S7 Cdt + T(t)de{t ) , 


( 4 ) 


where e{t) is a standard multi-dimensional Brownian motion. Geman & Hwang (1986) and 


Chiang et al. (1987) showed how, under certain conditions, this process concentrates at the 


global maximum of C as T —)■ 0. Kushner (1987) and Gelfand & Mitter (1991, 1993) later 


developed discrete-time versions of this that have the same convergence property. Ideas 
related to simulated annealing have proved useful in machine learning research from the 


perspective of MCMC sampling. For example, in Hamiltonian Monte Carlo (Neal 2010) and 


sampling with gradient Langevin dynamics (Welling & Teh, 2011; Ahn et al . 2012) gradient 


information is combined with noise to produce more efficient sampling. 

These results suggest that simulated annealing can significantly improve the performance of 
gradient-based optimization. With that said, they are also limited in the sense that: ( i ) The 
injected noise is restricted to have a Gaussian distribution, and (ii) choosing the optimal 
cooling function T(t ) is often impractical. In addition, in the variational setting evaluating 
the objective function to accept/reject may be a time-consuming procedure. To this end, 
modified annealing procedures that are outside the realm of provable convergence may still 


be useful for practical problems (Geman & Hwang, 1986), and one may trade guaranteed 
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convergence with practicality. In this case, the global optimum is traded for a better local 
optimum than those found by non-annealed gradient ascent. 


3 Annealing for Variational Inference 

We describe our “practical” modification to the globally convergent simulated annealing 
algorithm in the context of variational inference for conjugate exponential models. 


3.1 Variational inference for CEF models 


Variational inference for conjugate exponential family models, in which q is in the same family 
as the prior and A; is the natural parameter for q(9i), allows the gradient V\ t C to be written 
in a simple form, 

Va,£ = - ) (E,I<1 +Ao “ Ai) ’ (5) 
The vector E q [t\ is the expected sufficient statistics of the conditional posterior p(9i\X, <9_j) 
using all other q distributions and A 0 is from the prior on 9 t . 


Using a positive definite matrix M , the gradient update X t <— X t + 'jtM'V\ i £\\ i is globally 
optimal for a particular A; conditioned on all other q distributions when 7 * = 1 and M = 
— (d 2 In q(9i)/dXidXj) \ This corresponds to setting the gradient in Eq. (7) to zero which 
gives the familiar update 

A i t— E g [£] + Aq. (6) 


To develop stochastic annealing, our is to modify this update in a manner similar to the 
transition from Eq. ([2]) to Eq. (J3]) . 


Variational inference also requires initializing the variational parameters of each q{9j) dis¬ 
tribution. In this paper, we assume that each 9\ is initialized randomly in an appropriate 

way. 


3.2 Deterministic annealing for VI 


Deterministic annealing has been proposed for variational inference Katahira et al. (2008) 


This gives a general framework for annealing the variational objective function that does 
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not involve any randomness. With deterministic annealing, a trade off is made between the 
entropy and the expected log joint likelihood to avoid being trapped in a bad local optimum 
early on. This is done by multiplying the entropy term in the variational lower bound by a 
“temperature” parameter T > 1, 

£ = Eg [In p(X, 0)\ — 7fE g [ln q]. 

In early iterations (indexed by t) larger values of T favor smoother distributions because such 
distributions have higher entropy, and thus a higher-value for the objective function. As the 
number of iterations increase, T is gradually lowered (or “cooled”) which lets the variational 
distribution fit to the data. This way, better values for the variational parameters can be 
obtained. 

We can take the derivative of the lower bound with respect to \ to find the optimal update. 
This gives 

(7) 

Pre-multiplying by M = — {d? In q(9 j) / dAjdAf) 1 as before gives 

+ ^o)- (8) 

J-t 

As is evident, deterministic annealing down-weights the amount of information in the posterior, 
thus increasing the entropy, but the information it does incorporate is determined by the 
data. 


3.3 Stochastic annealing for VI 

Motivated by Eq. (|3]) , we propose a different approach to annealing the variational objective 
function. Similar to that equation, we propose the annealing update 


V 4— Aj + 7t-ATV+ T t e t . 


(9) 
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We chose the form of the preconditioning matrix M and the noise e t out of convenience, and 
also re-parameterize T t as follows, 


M 


/d 2 In q(0i)\ 1 

V dXidXj J ’ 


( 10 ) 


T t = ytp t , £ t = rj t - E q [t\ - X 0 . 


( 11 ) 


We set pt to be a step size that is shrinking to zero as t increases and discuss the random 
vector rjt shortly. Using the optimal setting of y t f° r conjugate exponential models discussed 
above, we set y t — 1 for all t, which gives the convenient update 


Xi <— (1 — pt)(E g [t] + Aq) + Ptdt- 


( 12 ) 


In contrast to simulated annealing, and similar to Welling & Teh (2011), we assume that 


all updates are accepted with probability one to significantly accelerate inference. The step 
size pt is a value decreasing to zero, and in this paper we assume that- p t — 0 for all t > T, 
with T preset. Therefore, this assumption does not impact convergence of the algorithm to a 
local optimal solution. We evaluate the quality assuming probability one acceptance by our 
experiments. 


We see that there is some relationship between stochastic and deterministic annealing. In 
deterministic annealing, T t > 1 and decreasing to one. The value T t = (1 — p f ) _1 is one 


possible setting, and so the first term in Eq. (12) can be viewed as exactly deterministic 


annealing. In addition, we introduce a random term, which has the effect of again reducing 
the entropy of q, but to a perturbed location that allows for exploration of the objective 
function similar to deterministic annealing. 


We observe that this annealing method requires setting rjt at each iteration. Recalling that 
Xi is randomly initialized according to an appropriate method, we propose generating rj t 
according to the same random initialization. In this case, each update has the intuitive 
interpretation of being a weighted combination of the true model updates and a brand new 
random initialization. As t increases, the weight of the initialization decreases to zero until 
the correct updates are used exclusively. We present an outline of this simulated annealing 
method in Algorithm [l} 
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Algorithm 1 An annealing algorithm for VI 
1 : For conjugate exponential models with q distributions in the same family as the prior. 
2 : Randomly initialize natural parameters A* of q(6i). 

3: for each q(6i) in iteration t do 
4: Set the step size p t . 

5: Calculate expected sufficient statistics E q [t\. 

6 : Generate new random initialization r]i }t for A,. 

7: Update A* t- (1 - p t )(E q [t] + A 0 ) + p t r]i,t- 

8: end for 


4 Experiments 


We evaluate our annealing approach for variational inference using three models: Latent 
Dirichlet allocation, the discrete hidden Markov model and the Gaussian mixture model. We 
compare the performance of stochastic annealing (stochAVI) with deterministic annealing 
(detAVI) and no annealing (VI). For deterministic annealing, we follow the approach of 


Katahira et al. (2008), with the specific extension of this to LDA discussed in Abrol et al. 


(2014). 


We describe the setup, annealing strategy and results for each of these models below. In 
each section, we first briefly review the problem setup, including the model variables and 
selected q distributions. We then discuss how our annealing approach can be applied to the 
problem. Finally, we discuss the results on the model. For all experiments we set pt = 0.9* 
for stochastic annealing and T t = 5(1 — p t ) -1 , which we empirically found to given results 
representative of the two methods; we note that the performance did not change significantly 
around the numbers 0.9 and 5. We mention that, due to the minimal overhead, the running 
time for stochAVI and detAVI was essentially the same as for VI. 


4.1 Latent Dirichlet allocation 


Setup. We first present experiments on a text modeling problem using latent Dirichlet 
allocation (LDA). We consider the four corpora indicated in Table [l] The model variables 
for a A'-topic LDA model of D documents are © = {/3\ : k, ki-.d}■ The vector tt d gives a 
distribution on /3 for document d and each topic [3k is a distribution on V vocabulary words. 
We use the factorized q distribution g(/3 1: x, = EE 9 (A0 El , and set each to 

be Dirichlet, which is the same family as the prior. We set the Dirichlet prior parameter of 
7 Td to 1/A' and the Dirichlet prior parameter of [3k to 100/V. We initialize all q distributions 















Table 1: The four corpora used in the LDA experiments and their relevant statistics. 



NIPS 

ArXiv 

NYT 

HuffPost 

# docs 

2.5K 

3.8K 

8.4K 

4K 

# vocab 

2.2K 

5K 

3.OK 

6.3K 

# tokens 

2.5M 

234K 

1.2M 

906K 

# word/doc 

1000 

62 

143 

226 


by scaling np a uniform Dirichlet random vector, with the specific scaling discussed below. 


Annealing. The standard variational parameter updates for LDA involve summing expected 
counts over all words and documents. This is done by introducing an additional variational 
distribution cm the allocation probability of the topic associated with the nth word in the 
fitli document. Below, we focus on the update of q((3k), and noting that a simple modification 
is required for q(itd)- We recall that the update for the variational parameter A& of /3k is 

Afc t— Y2d,n ^d,n(k)Wd,n + -^0; (13) 

where Wd, n is an indicator vector of length V for word n in document d. Using an un-scaled 
initialization of r/k,t /scale ~ Dir(l,..., 1), we modify this update to 

Afc t— (1 — P^iYld^^^ni^Wd^n + ^o) + PtVkJ- (14) 

As discussed above, with this update we first form the correct update to A& using the data. 
We then generate a new initialization for A& and take a weighted average of the two vectors 
using the step size p t . We set scale = cD/K for updating q(/3k) and scale = c/K for updating 
q{ir d ), where c is the average number of words per document. 

Results. In Figure [I] we show plots of the final value of the variational objective as a 
function of K using 50 runs of each model and inference method. As is evident, stochAVI is 
not uniformly better than detAVI in this problem, though both are clearly superior to VI 
without annealing. Empirically, we see that the performance of stochAVI improves compared 
with detAVI as the number of words per document increases, which is consistent with our 
additional experiments not shown here. This indicates a regime in which our approach would 
be preferable. We also observe that the annealed and non-annealed methods disagree on 
the appropriate number of topics for each corpus. Since the lower bound can be used for 
performing this model selection, these results indicate that the lower bound of the marginal 
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Figure 1: The variational objective function vs number of topics for variational inference 
using stochastic, deterministic and no annealing for LDA. In general, deterministic annealing 
outperforms our method, with some exceptions. Both annealing methods significantly 
outperform no annealing. We observe that annealing provides different, and possibly more 
accurate information on the appropriate number of topics when using the lower bound for 
model selection. 

likelihood given by the variational objective does not necessarily peak at the same value of 
K as the true marginal likelihood. Since the annealed results are overall better, they can be 
considered as providing better justification for choosing K. 
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4.2 The discrete hidden Markov model 


Setup. For the next experiment we considered the discrete Jl-state hidden Markov model 
(HMM). The model variables are O = {n, A,B}, where n is an initial state distribution, 
A is the Markov transition matrix and the rows of matrix B correspond to the emission 
probability distributions for each state. All priors are Dirichlet distributions and we therefore 
use Dirichlet q distributions for the factorization q(7T,A,B) = q( tt) n^Li q{Ak,:)q{Bk, : )- For 
the priors on A and 7 r we set the Dirichlet parameter to 1/K. For the priors on B we set 
the Dirichlet parameter to 10/D, where V is codebook size. As with LDA, we initialize all q 
distributions by scaling np a uniform Dirichlet random vector to the data size. 

We evaluate the annealed and standard versions of variational inference on two datasets: A 
character trajectories dataset from the UCI Machine Learning Repository, and the Million 
Song Dataset (MSD). The characters dataset consists of sequences of spatial locations of 
a pen as it is used to write one of 20 different characters. There are 2,858 sequences in 
total from which we held out 200 for testing (ten for each character). We quantized the 
3-dimensional sequences using a codebook of size 500 learned with K-means. For MSD we 
quantized MFCC features using 1024 codes and extracted sequences of length 50 from 500 
different songs. 


Annealing. As with LDA, the update for each q involves a sum over expected counts, this 
time involving the state transition probabilities learned from the forward-backward algorithm. 
Very generally speaking these updates are of the form 


Afc t— Em 4*nm,k + Ao, (15) 

where Ao is a prior and (f) nmy k is a probability relating to the mth emission in sequence n 
and state k, which is calculated by introducing a variational multinomial q distribution 
on the hidden data of state transitions. Since the distributions used are the same, the 
annealed modification is essentially identical to LDA. Using an un-scaled initialization of 
r/fc/scale ~ Dir(l,..., 1), we modify this update to 

A k t— (1 — Pt)(£„ Yhm & nm,k + Ao) + PtVk,t■ (16) 

That is, we form the correct update to A& using the data, generate a new initialization for 
Afc and then take a weighted average of the two using a step size p t —> 0. We again set 
p t = 0.25max(0,1 — t/ 50) and set scale = cN/K , where N is the number of sequences and c 
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(a) 5-state hidden Markov model 



- 3.3 


- 3.6 


- 3.9 


- 4.4 
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- 4.5 


- 5.0 
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- 3.9 


- 4.2 


- 4.5 


- 4.8 


- 5.2 


- 5.6 



(b) 10-state hidden Markov model 


Figure 2: The variational objective function (xlO 4 ) for each character, (red) stochAVI, 
(black) detAVI, (blue) VI. The proposed annealing consistently converges to a better local 
optimal approximation to the posterior of the hidden Markov model. 
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Figure 3: The variational objective function vs number of states for HMMs learned on 
quantized song sequences from the Million Song Dataset. We used 500 sequences of length 
50 taken from 500 songs. In general, the models learned with annealing are closer to the true 
posterior than those without it. We also see that stochAVI performs better than detAVI on 
this problem. 


is the expected length of a sequence. 


Results. In Figure [2]we show results of the variational lower bound for a 5-state and 10-state 
HMM learned from the characters dataset. As is evident, stochAVI consistently converges to 
a better posterior approximation than detAVI and VI. In Figure [3] we show the variational 
objective function for the MSD problem as a function of the number of states. Since we 
learn a joint HMM across songs, we find that a more complicated model with larger state 
space is better. Again we see that stochAVI outperforms detAVI, and that the annealing and 
non-annealing do not perfectly agree on the ideal number of states. 


4.3 The Gaussian mixture model 

Setup. For the final experiment we evaluated the performance of stochAVI on an a Ji-state 
Gaussian mixture model (GMM). The parameters for this model are 0 = {7r, /ii-.k, 
which includes the mixing weights n and mean and precision for each Gaussian (//*., A*,). We 
select q(n, Hi : K, Ai : k) = q{ 7r) Y\ k q{^k)q{Ak) as our factorization and set them to the same 
form as the prior, which is Dirichlet for 7r and independent normal and Wishart distributions 
for (nk,Ak). We evaluate the three inference approaches on the MNIST digits dataset. For 
this problem, we first reduced the dimensionality by projecting the original 28 x 28 images 
onto their first 30 principal components. We then randomly selected 1,000 digits for each 
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(a) Gaussians mixture (.K = 6) 


- 2 S C , 

- 2 ' 01 I 

- 2.02 T 
- 2.03 0 


- 1.62 
- 1.64 
-1.66 


i * 


- 1.97 5 
- 1.98 
- 1.99 
-2 


none det. stoch. none det. stoch. none det. stoch. “none det. stoch. none det. stoch. none det. stoch. none det. stoch. 


- 2.02 

- 2.04 


- 1.92 

- 1.94 

- 1.96 

det. stochl*’^*none det. stoch. 


- 1.92 9 
- 1.94 

- 1.96 


; det. stoch. none det. stoch. 


(b) Gaussians mixture (K = 12) 
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(c) Gaussians mixture (K = 18) 

Figure 4: The variational objective function (xlO 5 ) for each digit for a 6,12 and 18 component 
Gaussian mixture model. 


digit, 0 through 9, for training, and a separate 100 each for testing. We learned 50 different 
Gaussian mixture models for values of K e {3, 6, 9,12,15,18} for each digit, giving a total of 
3,000 experiments for each inference method. 


Annealing. Annealing for the GMM is more complicated in general than for LDA and the 
discrete HMM, which are restricted to Dirichlet-multinomial distributions. Annealing for 
q( 7r) is straightforward, being a Dirichlet distribution, and follows the approach outlined 
above: We introduce a variational distribution on the hidden cluster assignments, where 
c f> n is a variational multinomial distribution on the cluster for observation n. Updating the 
variational parameter of q(n) is then the same as LDA. The random vector rj t at iteration t 
with which this parameter is averaged corresponds to a random allocation of a dataset of the 
same size to the K clusters. 

We give a high level description for the more complicated q(fJ>k) and q(Ak) here. We use the 
allocation vector r) t to scale the initializations for each Gaussian and then perform weighted 
averaging of the sufficient statistics from the data with those calculated from the initialization. 
Since we deterministically initialize each q(A k ) to have an expectation of the empirical 
precision of the data, the update for q{A k ) corresponds to taking the correct distribution 
on the precision A k and shrinking it towards the prior. In effect, after each iteration this 
stretches out the true update for the covariance of each Gaussian increasing it’s “reach,” but 
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Figure 5: Variational objective function vs iteration for 50 runs of a GMM with 6 components: 
(red) stochAVI, (blue) VI. We omit detAVI since it deforms the objective, and so is only 
comparable after annealing is turned off. Both stochAVI and VI are evaluated on the true 
variational objective function for each iteration. A similar pattern per iteration was observed 
with LDA and the HMM. 

in decreasing amounts as pt —> 0. This is very similar to what is done by detAVI, with the 
exception that stochAVI incorporates randomness. 

For q(pk ) we randomly initialize the mean by drawing from a Gaussian with the empirical 
mean and covariance of the data. We initialize the precision to ten times the empirical 
precision of the data. Using this initialization in our annealing scheme corresponds to an 
update of q(pk) where the mean is approximately a linear combination of the empirical mean 
of the data assigned to cluster k with a new randomly initialized mean. The covariance of 
q{pk) is approximately the true update to the covariance stretched out according to the prior. 
Both updates for q(p,k) an d g(Tfc) increase uncertainty, allowing these Gaussians to move 
around more in the initial iterations. We note that detAVI does not result in a modification 
to the mean of q(pk), and so this is an additional feature of stochastic annealing for the 
GMM. 


Results. In Figure [I] we show box plots of the variational objective as a function of number 
of Gaussians for the three methods. Again, stochAVI outperforms detAVI and VI in that it 
converges to a better local optimal solution. It also appears robust to model complexity in 
that the gap in performance grows with an increasing number of Gaussians. 

In Table [2] we show quantitative performance of a prediction task using the 100 testing 
examples for each digit. Using a naive Bayes classifier based on the mean of the learned q 
distributions, we use average the classification accuracy for each value of K. Though the 
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Table 2: Bayes classification prediction accuracy averaged over digits 0 through 9. We observe 
a slight improvement in classification with effectively the same computation time. 


Model 

K=3 

K=6 

K=9 

K=12 

K=15 

VI 

0.945 

0.939 

0.934 

0.926 

0.922 

detAVI 

0.945 

0.943 

0.941 

0.933 

0.932 

stochAVI 

0.947 

0.944 

0.943 

0.938 

0.936 


methods do not achieve state of the art performance on this dataset, the relative performance 
is more important here, where we see a slight improvement with stochAVI, followed by 
detAVI and then VI. This indicates that the increase improvement of q can translate to an 
improvement in the end task, though the improvement is not major in this case. 

In Figure [5] we plot the variational objective as a function of iteration for the digit 0. We see 
that stochAVI starts out with worse performance as it explores the space of the objective 
function, but then converges on a better local optimal solution. We omit detAVI since it 
deforms the variational objective function, meaning the curve actually decrease with iteration 
since the scale of the entropy is decreasing with each iteration. 


5 Conclusion 

Variational inference is a valuable tool for scalable Bayesian inference, but convergence to a 
local optimal solution is a drawback that isn’t satisfactorily addressed with multiple restarts. 
We have presented a method for variational inference based on simulated annealing that can 
help remove the need for these restarts by allowing for convergence to a better local optimal 
solution. The algorithm is based on a simple approach of averaging random initializations 
with parameter updates after each iteration in a way that favors randomness exploration at 
first, and gradually transitions to the correct, deterministic updates. We showed through 
empirical evaluation that annealing can have a benefit for several standard Bayesian models 
and compares favorably with existing deterministic annealing approaches. 
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